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ABSTRACT 

While previous works on privacy-preserving serial data pub- 
lishing consider the scenario where sensitive values may per- 
sist over multiple data releases, we find that no previous work 
has sufficient protection provided for sensitive values that 
can change over time, which should be the more common 
case. In this work, we propose to study the privacy guar- 
antee for such transient sensitive values, which we call the 
global guarantee. We formally define the problem for achiev- 
ing this guarantee and derive some theoretical properties for 
this problem. We show that the anonymized group sizes used 
in the data anonymization is a key factor in protecting indi- 
vidual privacy in serial publication. We propose two strate- 
gies for anonymization targeting at minimizing the average 
group size and the maximum group size. Finally, we conduct 
experiments on a medical dataset to show that our method 
is highly efficient and also produces published data of very 
high utility. 

1. INTRODUCTION 

Recently, there has been much study on the issues in 
privacy-preserving data publishing [21 IT51 [T21 HI Hoi [2"71 171 
1141 1331 [9j 1221 115] . Most previous works deal with privacy 
protection when only one instance of the data is published. 
However, in many applications, data is published at regular 
time intervals. For example, the medical data from a hospital 
may be published twice a year. Some recent papers [191 1301 
[8l [6] 1231 [5] study the privacy protection issues for multiple 
data publications of multiple instances of the data. We refer 
to such data publishing serial data publishing. 

Following the settings of previous works, we assume that 
there is a sensitive attribute which contains sensitive values 
that should not be linked to the individuals in the database. 
A common example of such a sensitive attribute is diseases. 
While some diseases such as flu or stomach virus may not be 
very sensitive, some diseases such as chlamydia (a sex disease) 
can be considered highly sensitive. In serial publishing of 
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such a set of data, the disease values attached to a certain 
individual can change over time. 

A typical guarantee we want to achieve is that the proba- 
bility that an adversary can derive for the linkage of a person 
to a sensitive value is no more than l/i. This is well-known 
to be a simple form of ^-diversity [IB] . This guarantee sounds 
innocent enough for a single release data publication. How- 
ever, when it comes to serial data publishing, the objective 
becomes quite illusive and requires a much closer look. In se- 
rial publishing, the individuals that are recorded in the data 
may change, and the sensitive values related to individuals 
may also change. We assume that the sensitive values can 
change freely. 

Let us consider a sensitive disease chlamydia, which is a 
sex disease that is easily curable. Suppose that there exist 
3 records of an individual o in 3 different medical data re- 
leases. It is obvious that typically o would not want anyone 
to deduce with high confidence from these released data that 
s/he has ever contracted chlamydia in the past. Here, the 
past practically corresponds to one or more of the three data 
releases. Therefore, if from these data releases, an adver- 
sary can deduce with high confidence that o has contracted 
chlamydia in one or more of the three releases, privacy would 
have been breached. To protect privacy, we would like the 
probability of any individual being linked to a sensitive value 
in one or more data releases to be bounded from the above by 
1/1. Let us call this privacy guarantee the global guarantee 
and the value l/i the privacy threshold. 

Though the global guarantee requirement seems to be quite 
obvious, to the best of our knowledge, no existing work has 
considered such a guarantee. Instead, the closest guarantee of 
previous works is the following: for each of the data releases, o 
can be linked to chlamydia with a probability of no more than 
1 /£. Let us call this guarantee the localized guarantee. Would 
this guarantee be equivalent to the above global guarantee ? 
In order to answer this question, let us look at an example. 

Consider two raw medical tables (or micro data) Ti and 
T2 as shown in Figure [1] at time points 1 and 2, respec- 
tively. Suppose that they contain records for the individuals 
01,02,03,04,05. There are two kinds of attributes, namely 
guasi-identifier (QID) attributes and sensitive attributes. 
Quasi-identifier attributes are attributes that can be used to 
identify an individual with the help of an external source such 
as a voter registration list [2UI12|H51|29| . In this example, sex 
and zipcode are the quasi-identifier attributes, while disease 
is the sensitive attribute. Attribute id is used for illustration 
purpose and does not appear in the published table. We as- 
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sume that each individual owns at most one tuple in each 
table at each time point. Furthermore, we assume no addi- 
tional background knowledge about the linkage of individuals 
to diseases, and the sensitive values linked to individuals can 
be freely updated from one release to the next release. 
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Figure 1: A motivating example 
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(a) T" 



Figure 2: Anonymization for Ti and T2 

Assume that the privacy threshold is l/£ = 1/2. In a 
typical data anonymization |21lll2|[T3"ll29] . in order to protect 
individual privacy, the QID attributes of the raw table are 
generalized or bucketized in order to form some anonymized 
groups (AG) to hide the linkage between an individual and 
a sensitive value. For example, table T* in Figure [2ja) is a 
generalized table of Ti in Figure [T] We generalize the zip 
code of the first two tuples to 6500* so that they have the 
same QID values in T*. We say that these two tuples form 
an anonymized group. It is easy to see that in each published 
table T* or T2, the probability of linking any individual to 
chlamydia or flu is at most 1/2, which satisfies the localized 
guarantee. The question is whether this satisfies the global 
privacy guarantee with a threshold of 1/2. 

For the sake of illustration, let us focus on the anonymized 
groups Gi and G2 containing the first two tuples in tables 
T* and T 2 * in Figure [5] respectively. The probability in se- 
rial publishing can be derived by the possible world analysis. 
There are four possible worlds for G\ and G2 in these two 
published tables, as shown in Figure [3] Here each possible 
world is one possible way to assign the diseases to the indi- 
viduals in such a way that is consistent with the published ta- 
bles. Therefore, each possible world is a possible assignment 
of the sensitive values to the individuals at all the publication 
time points for groups Gi and Gi- Note that an individual 
can be assigned to different values at different data releases, 
and the assignment in one data release is independent of the 
assignment in another release. 

Consider individual 02. Among the four possible worlds, 
three possible worlds link 02 to "chlamydia" , namely wi , W2 
and W3. In wi and W2, the linkage occurs at Ti, and in W3, the 
linkage occurs at T2. Thus, the probability that 02 is linked 
to "chlamydia" in at least one of the tables is equal to 3/4, 
which is greater than 1/2, the intended privacy threshold. 
From this example, we can see that localized guarantee does 
not imply global guarantee. 

In this paper, we show that in order to ensure the global 
guarantee, the sizes of the anonymized groups need to be 
bigger than that needed for localized guarantee. In the above 
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Figure 4: Anonymization for global guarantee 



example, we can use size 4 anonymized groups as shown in 
Figure [4] There will be 4! x 4! possible worlds. It is easy to 
see that 3/4 of the possible worlds do not assign chlamydia to 
02 in the first release, 3/4 of them do not assign chlamydia to 
02 in the second release, and 3/4x3/4 = 9/16 of the possible 
worlds do not assign chlamydia to 02 in both releases. The 
remaining possible worlds assign chlamydia to 02 in at least 
one of the two releases. Hence, the privacy breach probability 
= 1 - 9/16 = 7/16 < 1/2. 

The contributions of this paper include the following: We 
point out the problem of privacy breach that arises with lo- 
calized guarantee and propose to study the problem of global 
guarantee in privacy preserving serial data publishing. We 
formally analyze the privacy breach with transient sensitive 
values. Useful properties related to the anonymization un- 
der the global guarantee are derived. These properties are 
related to the anonymized group sizes. Typically group sizes 
greater than that required for the localized guarantee will be 
needed to attain the global guarantee. These properties are 
then leveraged in the proposal of new anonymization strate- 
gies that can minimize the information loss. We have also 
conducted extensive experiments with a real medical dataset 
to verify our techniques. The results show that our method- 
ology are very promising in real world applications. 

The rest of this paper is organized as follows. Section [5] 
surveys the previous related works. Section [3] contains our 
problem definition. Section [4] describes a general formula for 
the breach probability. Section [5] discusses some key proper- 
ties for this problem. Section|6]describes our methodology for 
privacy protection. Section [7] suggests a possible implemen- 
tation. Section [8] is an empirical study. Section [9] concludes 
our work and points out some possible future directions. 

2. RELATED WORK 

Here, we summarize the previous works on the problem of 
privacy preserving serial data publishing, fc-anonymity has 
been considered in [8] and |19| for serial publication allowing 
only insertions, but they do not consider the linkage probabil- 
ities to sensitive values. The work in [23] considers sequential 
releases for different attribute subsets for the same dataset, 
which is different from our definition of serial publishing. 

There are some more related works that attempt to avoid 
the linkage of individuals to sensitive values. Delay publising 
is proposed in [6] to avoid problems of insertions, but deletion 
and updates are not considered. While |30| considers both in- 
sertions and deletions, both [6] and [30] make the assumption 
that when an individual appears in consecutive data releases, 
then the sensitive value for that individual is not changed. As 
pointed out in [5], this assumption is not realistic. Also the 
protection in [3D] is record-based and not individual-based. 
This is quite problematic, as in our running examples, there 
are two records for one individual 02, namely, ti in table Ti 
and £2 in table T2 (note that Ti and T2 need not be con- 
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(a) Possible world 1 w\ 
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(c) Possible world 3 
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Figure 3: Possible worlds for G\ and G2 



secutive releases, so that the sensitive value linked to 02 can 
change even if we adopt the above unrealistic assumption in 
[U [30] ) . If we consider just tuple t\, then there are only 2 
possible worlds where t\ is linked to chlamydia in Figure [3] 
namely Wi and W2- If we just consider tuple t2, there are 
also only 2 possible worlds linking it to chlamydia, namely 
wi and WI3. Hence, T* and T 2 * satisfy the record-based re- 
quirement of [30] if the risk threshold is 0.5. In fact, these 
are possible tables generated by the mechanism proposed in 
[3D] . However, we have shown that this anonymization does 
not provide the expected protection for the individuals. 

The ^-scarcity model is introduced in [5] to handle the sit- 
uations when some data may be permanent so that once an 
individual is linked to such a value, the linkage will remain 
in subsequent releases whenever the individual appears (not 
limited to consecutive releases only). However, for transient 
sensitive values, [30] and [S] adopt the following principle. 

Principle 1 (Localized Guarantee). For each re- 
lease of the data publication, the probability that an individual 
is linked to a sensitive value is bounded by a threshold. 

However, we have seen in the example in the previous sec- 
tion that this cannot satisfy the expected privacy require- 
ment. Hence, we consider the following principle. 

Principle 2 (Global Guarantee). Over all the pub- 
lished releases, the probability that an individual has ever been 
linked to a sensitive value is bounded by a threshold. 

Although the privacy guarantee is the most important data 
publication criterion, the published data must also provide a 
reasonable level of utility so that it can be useful for ap- 
plications such as data mining or data analysis. Utility is 
a tradeoff for the privacy guarantee since anonymization of 
data introduces information loss. There are different defi- 
nitions of utility in the existing literature. Here, we briefly 
describe some common definitions. 

The anonymized group sizes have been considered in utility 
metrics. The average group size is considered in [16]. In [3], 
the discernability model assigns a penalty to each tuple t as 
determined by the square of the size of the anonymized group 
for t. In [12], the normalized average anonymized group size 
metric is proposed, which is given by the total number of 
tuples in the table divided by the product of the total number 
of anonymized groups and a value k (for fc-anonymity). Here, 
the best case occurs when each group has size k. 

Other works [291 1311 126] consider categorical data that 
comes with a taxonomy so that the information loss is mea- 
sured with respective to the structure in the taxonomy when 
data are generalized from the leaf nodes to higher levels in 
the taxonomy. Both [11] and [28] measure utility by compar- 
ing the data distributions before and after anonymization. 



Recently, [20] and [33] consider the accuracy in answering 
aggregate queries to be a measure of utility. 

[101 1241 [T] assume that the data is utilized for classification 
and hence define the utility accordingly. The anonymization 
mechanisms in 17, 2, 32 are by means of suppressing data 
entries in the table, and hence information loss is measured 
by the number of suppressed entries. 

3. PROBLEM DEFINITION 

Suppose tables T\,T2, ...,Tk are generated at time points, 
1,2,..., A, respectively. Each table T has two kinds of at- 
tributes, quasi-identifier attributes and sensitive attributes. 
For the sake of illustration, we consider one single sensi- 
tive attribute S containing |S| values, namely Si, S2, sigi. 
Assume that the sensitive values for individuals can freely 
change from one release to another release so that the link- 
age of an individual o to a sensitive value s in one data release 
has no effect on the linkage of o to any other sensitive value 
in any other data release. Assume at each time point j, a 
data publisher generates an anonymized version T* of Tj for 
data publishing so that each record in Tj will belong to one 
anonymized group G in T* . Given an anonymized group G, 
we define G.S to be a multi-set containing all sensitive values 
in G, and G.I to be the set of individuals that appear in G. 

Definition 1 (Possible World). A series of tables 
TS — {Tf , Tf, Tj!} is a possible world for published ta- 
bles {T*, , T£} if the following requirement is satisfied. 
For each i £ [1, k], 

1. there is a one-to-one corresponding between individuals 
in Tf and individuals in T* 

2. for each anonymized group G in T* , the multi-set of the 
sensitive values of the corresponding individuals in Tf 
is equal to G.S. 

Let p(o, s, k) be the probability that an individual o is 
linked to s in at least one published table among published 
tables Tt,Tj,...,Tj*. 

Let t.S stand for the sensitive value of tuple t. We say 
that o is linked to s in a table Tf if for the tuple t of o in Tf, 
t.S = s. Following previous works, we define the probability 
based on the possible worlds as follows. 

Definition 2 (Breach Probability). The breach 
probability is given by 

P (o,s,k)= W '^ S ^ (1) 

Wtotal,k 

where Wn„k(o,s,k) ts the total number of possible worlds 
where o is linked to s in at least one published table among 
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Tf ,T%, ■■■jTfc and Wtotai.k is the total number of possible 
worlds for published tables , T£, T£- 

We will describe how we derive a general formula to calcu- 
late p(o, s, ft) in Section |3] 

While privacy breach is the most important concern, the 
utility of the published data also need to be preserved. There 
are different definitions of utility in the existing literature. 
Some commonly adopted utility measurements are described 
in Section 

In this paper, we are studying the following problem. 

Problem 1. Given a privacy parameter £ (a positive in- 
teger), a utility measurement, k — 1 published tables, namely 
T*, T%, ...,Tfc_i and one raw table Tk, we want to generate a 
published table T£ from Tk such that the utility is maximized, 
and for each individual o and each sensitive value s, 

p(o,s,k) < l/£ 

Note that the above problem definition follows Principle [2] 
for global guarantee as discussed in Section [2] 

3.1 Global versus Localized Guarantee 

Here, we show that protecting individual privacy with Prin- 
ciple [2] (global guarantee) implies protecting individual pri- 
vacy with Principle Q] (localized guarantee) . Under Princi- 
pleJTJ let q(o, s,j, k) be the probability that an individual o is 
linked to a sensitive value s in the j-th table. Following the 
definition of probability adopted in most previous works [301 
.5:, we have 



q(o,s,j,k) = 



Lu n k{o, s,j, k) 
Wtotai.k 



where Li in k(o,s,j,k) is the total number of possible worlds 
in which o is linked to s in the j-th table and Wtotal,k is the 
total number of possible worlds for the k published tables. 

In our running example, k=2 and from Figure [3] there are 
four possible worlds, Wtotai.k = 4. Consider published ta- 
ble Tj" . There are two possible worlds where 02 is linked to 
chlamydia (s), namely Wi and 11)2- Thus, Li ink (o 2 , s,l, k) = 



2 and 5(02, s, 1, ft) = 



Similarly, when j = 2, 



q{o 2 ,s,2,k) = §. 

In general, it is obvious that Wu„k(o, s, ft) > Lu„k(o, s,j, k) 
for any j G [1, ft]. We derive that 

p{o,s,k) > q(o,s,j, k) 

Hence we have the following lemma. 

Lemma 1. If p(o,s, k) < 1/1 (under Principle^, then for 
any j £ [1, ft], q(o,s,j,k) < 1/1 (under Principle QP. 

Corollary 1. Principle^ (global guarantee) is a strictly 
stronger requirement than Principle^ (localized guarantee). 

4. BREACH PROBABILITY ANALYSIS 

In this section, we consider how the breach probability 
p(o, s, ft) can be derived. For privacy breach, we focus on the 
possible assignment of sensitive values to one individual at 
a time. Therefore, we introduce the following possible world 
definition to deal with assignments to a particular individual. 



Definition 3 (AGi). At any data release, letAGi(o) be 
the anonymized group that contains the record for individual 
o in published table T* . 

For the sake of clarity, if the context is clear, we omit the 
subscript and denote AGi(o) by AG(p). 

Definition 4 (Possible World for o). Given a pos- 
sible world TS = {Tf,7f,...,T fc p } for {Ti\ T 2 *, T fe *}. Let 
us extract the tuples in each Tf that correspond to the tuples 
in the anonymized group AGi (°) ( containing individual o in 
T* ) to form table Tf(o). Then, the series of smaller tables, 
denoted by TS{o) which is equal to {Tf (o), Tf(o), ...,T%(o)}, 
form a possible world for AGi(o), ... AGk(o). We also say 
that that TS(o) is a possible world for o for {Tf , T 2 * , . . . , T^ } . 

For example, Figure [3] shows all the possible worlds for Gi 
and G2 for 02 in the published tables shown in Figure O^a) 
and Figure [2jb). Note that in the above definition, if o does 
not appear in a table T, then Tf{6) is an empty table. 

4.1 Possible World Analysis 

Since the sensitive values are transient and we do not as- 
sume any additional knowledge about the data linkage, the 
assignment of sensitive values to individuals in groups other 
than AG{o) are independent of the assignment to the indi- 
viduals in AG(o). Hence, we arrive at the following lemma. 

Lemma 2. The value of p(o,s,k) can be derived based on 
the analysis of the possible worlds for o. 

The above lemma helps to greatly simplify the analysis of 
the privacy breach by considering only AG(o) in each data 
release. In the following, we may refer to a possible world for 
o simply as a possible world. 

Consider an anonymized group AG (o) in Tj for individual 
o. Let nj be the size AG{o). Let n.j t i be the total number 
of tuples in AG{o) with sensitive value Si for i = 1,2, \S\. 
The total number of possible worlds for AG (o) can be derived 
by combinatorial analysis. 

Lemma 3 (No. of Poss. Worlds for Single Table). 
The total number of possible worlds for the anonymized group 
AG{o) in a single published table T* is equal to 

nA 



nli'x "Ail 



For example, consider an anonymized group of size 4 con- 
taining two si values, one s 2 value and one S3 value in T*. 



Then, Wj is equal to 



12. 



4.2 Breach Probability 

Recall that our objective is to compute p(o,s,k) which 
involves two major components, namely Wu n k(p, s,k) and 
Wtotai.k- In the following, we will describe how we obtain the 
values of these two components. 

By Lemma [3] the total number of possible worlds for o 
in the published tables T* ,T 2 * , ...,Tfc, denoted by Wtotai.k, is 
equal to 



Wtotai.k = n w i = n 
j=i 



L n |S| n l 

1 1L=i "-j.*- 



(2) 
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Next, we will describe how to obtain the formula for 
Wunk(o,s,k). Without loss of generality, we consider the 
privacy protection for an arbitrary sensitive value s = si. 
The following analysis applies for each sensitive value. 

Note that, for any arbitrary sensitive value si, we have the 
following. 

W to tal,k = Wli„k(o, Bl,k)+ Wlink(<>, si, k) 

where Wimkip, si, k) is the total number of possible worlds 
where o is not linked to si in all k published tables, namely 
Tf, T*,...,T%. Thus, 

Wu n k(o, Si, k) = Wtotal.k - Wlink(<>, Si, k) 



Next, we will show how we derive Wu n k{p, Si, k). Let 
9(o,si,j) be the total number of possible worlds for table 
TJ (treated as a singleton table series) that o is not linked to 
si. 

Consider a possible table T? where o is not linked to si. 
Since o is not linked to si in T?, o is linked to a sensitive 
value s q where q 7^ 1 in T?. The number of possible worlds 
for T? where o is linked to s q in T p is equal to 



Ws q ,j = 



{nj - 1)! 



By considering all sensitive values s q where q 6 [2, \S\], the 
total number of possible worlds for T? where o is not linked 
to si (i.e., 9(o, si,j)) is equal to 



9=2 



E 



(nj - 1)! 



=2 («*,«-!)! mil 



(nj - l)!nj,, 

n |s| n ■' 



E 



("J - 1)! 



|S| 

E n i 



Consider H / K nfc (o, si, fc) 



3=1 



"1 ni^ n 

k 



'J.v j=l9=2 



A nJ2i»--' 



From Equation ([T]), 
p(o,si,fc) 

Wl ink (o, Sl,k) 

Wtotal.k 
W to tal,k -Wlink(o,Sl,k) 



WtotaLk 



nil 1 ! > 

















nS«j,*i 



)(rij=iK -^.1)) 



m=i % 

Lemma 4 (Closed Form of p(o,si,k)). 



f(o,si,fc) 



(3) 



From Equation Q, p(o, si, k) is defined with a conceptual 
terms with the total number of possible worlds. Lemma 2] 
gives a closed form of p(o, si,k). Given the information of 
nj (i.e., the size of the anonymized group in the j-ih table) 
and rij,i (i.e., the number of tuples in the anonymized group 
with sensitive value si in the j-th table), we can calculate 
p(o, si, k) with its closed form directly. 

Example 1 (Two- Table Illustration). Consider 
that we want to protect the linkage between an individual 
and a sensitive value si. Suppose o appears in both pub- 
lished tables Ti and Tjf. Let AGi(o) and .402 (o) be the 
anonymized groups in T* and TJ containing o. Suppose 
both AQi(o) and AG 2(0) are linked to si. 

By the notation adopted in this paper, is the size of 
AGk(°) and Uk,i is the total number of tuples in AGk(°) 
with sensitive value si. 

By Lemma U we have 



p(o,si,k) = 



nin 2 - (m - ni,i)(n 2 — n 2 ,i) 
niri2 

n2,wi + ni,in 2 - 711,1712,1 
nin2 



□ 



Example 2 (Running Example). In our running ex- 
ample as shown in Figure [2] consider the second individual 
02 and a sensitive value "chlamydia". We know that ni = 
n 2 = 2. Suppose si is "chlamydia". Thus, ni,i = 712,1 = 1. 
With respect to the published tables as shown in Figure [2] 
according to the formula derived in Example [1] 

1x2 + 1x2-1x1 3 
P( ° 2 ' Sl ' 2) = 272 = 4 

which is greater than 1/2 (the desired threshold). 

However, if we publish tables as shown in Figure [4] then 
m = 7i2 = 4 and ni 1 = 712,1 = !• 



(o 2 ,si,2) 



1x4+1x4-1x1 



4x4 



16 



■J,v j=l 



which is smaller than 1/2. 
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In this example, we observe that, since the published tables 
as shown in Figure [4] have a larger anonymized group size 
(compared with the published tables as shown in Figure 0, 
p(o2, si, 2) is smaller. rj 

In this paper, we aim to publish table T k like Figure [4] at 
each time point k such that p(o, s, k) < l/l for each individual 
o and each sensitive value s. 



— 

n,=i n i 

ntNi-a-^ntx 1 ^-^,!) 



nk-l 
3=1 ^' 



1 

< - 



From Example [5] we observe that a larger anonymized 
group size reduces the breach probability that individual o 
is linked to sensitive value Si in the past. However, the 
anonymized group size alone cannot reduce the breach proba- 
bility. Consider that an anonymized group in published table 
T£ contains all sensitive values si, instead of distinct sensi- 
tive values. Even though this anonymized group is larger, if 
it still contains all sensitive values si, it is easy to verify that 
an individual o in this anonymized group must be linked to 
si in this table T%. 

In fact, the breach probability is determined by the 
anonymized group size ratio. The anonymized group size ra- 
tio is equal to the anonymized group size divided by the to- 
tal number of tuples in this anonymized group with sensitive 
value si. In Example [2j since all sensitive values are distinct 
in an anonymized group (i.e., the total number of tuples in 
this anonymized group with sensitive value si is equal to 1), 
the anonymized group size ratio is equal to the anonymized 
group size. In the next section, we will show that the larger 
anonymized group size ratio can reduce the probability. 



5. THEORETICAL PROPERTIES 

In the previous section, we describe that a larger 
anonymized group ratio can reduce the breach probability. 
In this section, we will first study some properties of our 
problem, including a minimum anonymized group ratio for 
global privacy guarantee, and then a monotonicity property 
that can be useful in data anonymization. 

5.1 Minimum AG size Ratio 

Recall that n k is the anonymized group (AG) size and rik,i 
is the number of tuples in the anonymized group with sensi- 
tive value si. In the following, we will derive the minimum 
anonymized group size ratio jp*- for privacy protection under 
the global guarantee. 

Theorem 1. Let k be an integer greater than 1. Suppose 
the anonymized group in T£ containing individual o is linked 
to si. p(o, si,k) < l/l if and only if 



Ti- = i% -riLi 1 ^- 



n k 



(*-i)n 



(4) 



Proof: By Lemma |3J p(o, s, k) is equal to 

nj=i "j-nj = i("j-" ; .i) 



nf =1 . 



n k tUiJinj-nj,!) 



□ 



From the above, for any k > 1, we can see that the value 
of -5*- should be lower bounded by the value of 



n(fc) 



We define n(k) — £ when k = 1. 

Example 3 (Running Example). From Example [5J 
we know that the published tables shown in Figure [4] satisfy 
the privacy requirement (i.e., p(o,s,k) < l/l where k — 2 
and 1 = 2). At time k = 3, we want to publish a new table 
T3 from a raw table T3 which contain 02. 

Suppose we will put 02 in the anonymized group AG 3(02) 
in T 3 * which is linked to si where si is chlamydia. By Theo- 
rem [T] when k = 3, the R.H.S. of Equation @ becomes 

£(m - ni,i)(n 2 — n 2 ,i) 



p(o,si,k) < l/l 



l(n\ — m,i)(ri2 — n2,i) — {(■ — l)mn2 

2(4-l)(4-l) 
2(4 - 1)(4 - 1) - (2 - 1) ■ 4 ■ 4 
= 9 

which is the minimum anonymized group size ratio -^-^ in 
the published table T 3 *. Suppose AG 3(0) contains only one 
occurrence of s\. Then, the size of the anonymized group 
AGs(o) should be at least 9 so that p(o, Si,3) < 1/2. rj 



We have the following corollary when the inequality in The- 
orem [1] becomes an equality. 

Corollary 2. =n(k) ifandonlyifp(o,si,k) = l/l. 



When a record for individual o appears in a data release T; 
and in the published data T* , the anonymized group contain- 
ing o has no relation to sensitive value s, then intuitively, this 
release should not have any impact on the privacy protection 
of o linking to s. This is formally stated in the following 
lemma. 

Lemma 5. If the anonymized group in T k containing o is 
not linked to si, then p(o, s\,k) = p(o, si, k — 1). 
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Proof: Since the anonymized group in T£ containing o is not 
linked to si, we know that o is linked to si in one of the first 
(k — l)-th published tables. Thus, 

Wlink (o, Si , k) =W k X Wu nk (p,si,k-l) 

Thus, we have 

Wu„k(o, si,k) 



p(o,s,k) = 



(From Equation fl2|) 



Wtotal,k 

W k X Wi ink (o,si,k - 1) 

n U 11 

Wli„k(0, Si, k - 1) 
Wtotal,k-l 

= p(o,s,fc-l) 



□ 

Thus, can be equal to any real number and does not 

affect the value of p(o, s, k) in this case. 

Suppose a published table T£ contains o and we need to 
generate an anonymized group G containing o. Note that the 
size of the anonymized group G is n k and the number of tuples 
in G with sensitive value Si is equal to n k ,i for i — [1, |S|]. 
Without loss of generality, suppose we want to protect the 
privacy linkage between an individual o and a sensitive value 
Si. From Theorem [T] and Lemma [5] we can determine the 
minimum value of for generating an anonymized group 
G. From Theorem [1] if G contains si, in order to guarantee 
p(o, si, ft) < l/l, we have to set the value of n k to satisfy 

— > n(k) 

From Lemma [5] if G does not contain si, any value of 
will not affect the privacy related to o and si. 

Although Theorem [T] suggests that if we set the value of 
^kj- at least n(fc), then p(o,si,k) < l/l. However, suppose 
we set this value exactly equal to n(ft), although we can guar- 
antee p(o, Si,k) < l/l for the k published tables, there will 
a privacy breach (i.e., p(o,si,k') > l/l) for any additional 
future published tables in which an anonymized group con- 
taining o is linked to si. This is a result of the following 
lemma. 

Theorem 2. Consider that we published k — 1 tables where 
an anonymized group in T£_i containing o is linked to si. 
Suppose we are to publish T k where an anonymized group in 
T k containing o is also linked to si. // = n(k — 1), 

then p(o,s\,k) > l/l. 

5.2 Monotonicity 

Monotonicity is a useful property for some anonymization 
process where the resulting anonymization groups are con- 
structed in a bottom-up manner, merging smaller groups that 
violates the privacy requirement into bigger groups which 
may guarantee privacy. It is also useful when the anonymiza- 
tion is top-down, splitting bigger groups into smaller ones as 
long as the privacy guarantee holds. 

Consider the privacy protection for the linkage of an indi- 
vidual o to a sensitive value Si . From Lemma [5j we know 
that p(o,si,k) is independent of data releases in which any 
anonymized group containing o (in the published tables) are 



not linked to si. Hence, in the following, we consider the 
worst-case scenario where in all releases whenever there ex- 
ists an anonymized group containing o (in a published table) , 
o is linked to si. 

The monotonicity property is described as follows. 

Theorem 3 (Monotonicity). p(o, si,fc) is strictly de- 
creasinq when increases. 

The proof is given in the appendix. Note that n^/n^i 
is essentially the inverse of the proportion of si tuples in 
the anonymized group. Therefore, when a bigger group that 
satisfies the privacy requirement is split into smaller ones, if 
the proportion of si tuples in the small group containing o is 
not increased, then p(o,si,k) is not increased. Conversely if 
a small group violates the privacy guarantee, merging it with 
another group may decrease the proportion of Si tuples and 
thus p(o, s, k) may be decreased. 

An anonymized group AG is said to violate the global guar- 
antee if there exists an individual o 6 AG. I and a sensitive 
value s 6 AG- S such that p(o, s, ft) > l/l. 

Corollary 3. Consider an anonymized group AG m the 
published table T k which violates the global guarantee. If we 
partition AG into a number of smaller groups, one of the 
smaller groups violates the global guarantee. 

Proof Sketch: Suppose nk/nk.i is the size ratio for AG- It 
is easy to see that one of the smaller groups has the size ra- 
tio smaller than nk/nk,i- By Theorem [3] p(o,s,k) increases. 
Since AG violates the global guarantee (i.e., p(o, s, ft) > l/l), 
the smaller group also violates the global guarantee (i.e., 
p(o,s,k) > l/l). n 

6. ANONYMIZATION 

In previous sections, we have observed that, by choosing a 
proper size of an anonymized group, the global privacy guar- 
antee can be achieved. In general, a size above a certain 
threshold size can be chosen. However, setting a size equal 
to the threshold size will make future anonymization infeasi- 
ble (see Theorem [2} . Therefore, it is necessary to choose a 
size that is greater than the threshold. The increase in size 
however, would lead to a decrease in the utility of the data. 
Hence, a question will be how to pick a smallest size that can 
maintain the global guarantee. 

In this section, we show that if we are given a bound on the 
number of releases where an individual o may be linked to a 
sensitive value s, then we can devise a strategy to minimize 
the maximum anonymization group size. We also propose 
another strategy which aims to reduce the anonymized group 
size on average. 

6.1 Constant- Ratio Strategy 

In database related problems, one can typically derive ef- 
fective mechanisms based on the characteristics of the data it- 
self. In our problem scenario, a data publisher has at his/her 
disposal the statistical information of the data collections. 
For example, consider the medical database. The statistics 
can point to the expected frequency of an individual con- 
tracting a certain disease over his or her lifespan. With such 
information, one can set an estimated bound on the num- 
ber of data releases that a person may indeed be linked to 



7 
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2 


2 


2 


5 


10 


2 


5 


10 


k' 


2 


5 


20 


20 


20 


10 


10 


10 


Tic 


3.44 


7.75 


29.41 


90.13 


190.33 


15.10 


45.35 


95.42 



Table 1: Values of n c with selected values of I and k 



the disease. With this knowledge, one can adopt a constant- 
ratio strategy which we shall show readily can minimize the 
maximum size of the corresponding anonymized groups. 

Constant-ratio strategy makes sure that the size of 
anonymized groups AG(o) for individual o containing Si di- 
vided by the number of occurrences of si remain unchanged 
over a number of data releases. Formally, given an integer k' 
for the number of data releases, for i £ [1, fc'l, 



where n c is a positive real number constant, and Oi is a times- 
tamp for the i-th release where both o and si appear. For 
the sake of simplicity, we set n c = — where n and n s are 
positive integer constants where n s < n. 

k' corresponds to the total number of possible releases in 
the future. In other words, during data publishing, the data 
publisher expects to publish k table for this data. With this 
given parameter k' , we can calculate n and n s such that - — 

remain unchanged when i changes. 

In order to make sure that p(o,si,j) < 1/i for any j £ 
[1, k'], we need to protect p(o, si, k') < 1/i. In the following, 
we consider p(o, si, k') which is equal to 



n5=i n i 



n5=i(% 



nj.i) 



n 



< - 



3=1 
k' 



— (n — n s ) < 



n 



k' 1 
n x - 



— < 1 

n 



Let n c = l/[l-(l-i) 1 /*']. _ 

Table [T] shows the values of n c with selected values of £ 
and k' . When £ increases, n c increases. When k! increases, 
n c also increases. 

It remains to show that the constant-ratio strategy indeed 
can lead to data publishing that minimizes the maximum 
anonymized group sizes. First, we define this property more 
formally. 

Definition 5 (Min-Max optimization). An 
anonymization for serial data publishing is min-max 
optimal if the maximum anonymized group size among the 
anonymized groups containing individual o and sensitive 
value Si for any given o and si over all data releases is 
minimized. 

Theorem 4 (Optimality). The constant-ratio strategy 
generates a min-max optimal solution for serial data publish- 
ing. 

Proof: Let N be the set of anonymized group sizes in the 
k' published tables where these anonymized groupes contain 



o and are linked to s\. That is, N = {m, n%, ny}- Let 
u(N) = max ni gjv Hi. Let N a be the set of anonymized group 
sizes in the k' published tables generated by strategy a. 

Let p(o,si,k'\a) be p(o, si,fc') with respect to strategy a. 
Let A be the set of all possible strategies a such that, with the 
published tables with strategy a, p(o,s\,k'\a) < 1/i. Sup- 
pose a is the constant-ratio strategy. We will prove that 
this strategy can obtain an optimal value of u{N). That is, 

u{Na„) = mm{u(N a )} 

We prove by contradiction. Consider that the strategy a 
generates N a<3 — {ni, ri2, n k i }. By Corollary [2] it is easy 
to verify that p(o, si, k'\a ) = 1/i. 

Suppose there exists a strategy a' 7^ a which gener- 
ates N a i = {n'i , n' 2 , n' k , } such that u(N a i) < u(N ao ) and 
p(o, si 7 k'\a'} < 1/i. We deduce that, for all i e [1, k'], 

< nt 

By Theorem [2j we know that, p(o, si, k'\a') > p(o, si, k'\a ) 
(which is equal to 1/i). We conclude that p(o, si , k'\a) > 1/i. 
Thus, privacy breach occurs, which leads to a contradiction. 

□ 

Although the constant-ratio strategy generates a min-max 
optimal solution, the statistical information about the data 
should be known. For example, the constant-ratio strategy 
requires the priori knowledge about k' which is equal to the 
total number of possible releases in the future. If such in- 
formation is unavailable, we can use the geometric strategy 
proposed in the next subsection where this strategy does not 
require the statistical information. 

6.2 Geometric Strategy 

Other than minimizing the maximum anonymized group 
size, another desirable utility criterion will be to minimize 
the average group size. In order to achieve this goal, we 
examine the probability of occurrences of anonymized groups 
for linking individuals o to a certain sensitive value s. From 
past data, there will be a distribution of the total number 
of releases where any given individual has contracted disease 
s. For example, if the maximum of such value is 10, some 
individuals may be linked to s 10 times in total, but most 
individuals may be linked to s less than 10 times in total. 
Typically, the number of individuals that are linked to s for 
at least k releases will be greater than that for k" releases 
where k < k" . Therefore, when choosing the sizes of the 
anonymized groups, it will reduce the average group size if we 
choose smaller sizes for the earlier releases where o is linked 
to s and bigger sizes for the later such releases. This is the 
essence of our next proposed strategy, namely, the geometric 
strategy. 

With the geometric strategy, the anonymized group size 
will be equal to the minimum feasible value of n(fe) multiplied 
by a factor, a, at any time point k. This will be a growing 
value since the value of n will grow with k. Note that a must 
be greater than 1 since with a = 1, the minimum feasible 
will be used, and from Theorem [2] that will make fu- 
ture selection of group size infeasible. The value of a can be 
selected based on the estimated number of releases where an 
individual will be linked to s in total. 
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Figure 5: Effect of a and I on ratio nk/nk,i 

Thus, with this strategy, with j > 1, we set 
— = a ■ n(j) 

Figure [S] shows how the values of increases with k. 

Figure 0_J a) studies the effect of a (with £ set to 2) and Fig- 
ure [5jb) studies the effect of I (with a set to 5). When k 
increases, increases. When a is larger, although the ini- 
tial value of ^^j- is larger, the grow rate of -^-^ is smaller. 
When I increases, -^-^ increases. Figure _5ja) shows that 
a = 5 and 10 are better choices than a = 3 since the increase 
in the ratio is much slower. Note that these values are all 
pre- computable and it is easy to choose a suitable parameter 
by examining the pre-computed trends. 

6.3 Discussion 

In both of the above strategies, there may occur rare occa- 
sions where the required anonymized group size is not avail- 
able in the given data set. As with previous works [301 [5], 
we handle the exceptional cases by data distortion. We can 
suppress the sensitive values of individuals when it is found 
that no feasible group size can maintain the global guarantee 
for privacy preservation. From our experimental results, such 
suppression has not been found needed. 

Though our discussion has been based on a single value for 
the sensitive attribute in each record, our results can be easily 
extended to the case where each record may contain a set of 
values for the sensitive attribute. The essential proportion of 
possible worlds where an individual is linked to a sensitive 
value would not be affected. 

7. IMPLEMENTATION 

In Section [()] we describe two strategies to determine the 
value of ^p*j- for privacy protection with respect to a sensitive 
value si. In the following, we describe how to anonymize the 
table given the desired value of 

Since the formula is based on the frequency that a tuple for 
individual o is linked to a sensitive value s in an anonymized 
group from published tables (by Theorem [T] and Lemma [5)l , 
we propose to keep a data structure, called statistics file, to 
store the sizes of the anonymized groups containing a record 
for individual o such that o is linked to a sensitive value s, 
denoted by m(o, s). Consider an individual o and a sensitive 
value s. Let the anonymized groups containing o in T* ,T% 
and T 3 * be Gi (of size 3), G2 (of size 5) and G3 (of size 4), 
respectively. If Gi and G2 contain s but G3 does not, m(o, s) 
is equal to {3,5}. Suppose there is another published table 
T4 which does not contain o. m(o, s) is also equal to {3, 5}. 



Given the statistics file, it is possible to adopt exist- 
ing known anonymization methods to generate anonymized 
groups that satisfy the group size ratio requirement of inter- 
est. For example, we may use a bottom-up approach to grow 
the anonymized groups. Alternatively, we can use a top-down 
approach to keep breaking up large anonymized groups and 
stops when it begins to violate the group size ratio require- 
ment. 

8. EMPIRICAL STUDIES 

All of our experiments have been performed on a Linux 
workstation with a 3.2Ghz CPU and 2 Giga-byte memory. 
Similar to 5\, we deploy one public available real hospital 
database CADRMFfl In the database, there are 8 tables: Re- 
ports, Reactions, Drugs, ReportDrug, Ingredients, Outcome, 
and Druginvolve. Reports consists of some patients' basic 
personal information. Therefore, we take it as the voter reg- 
istration list. Reactions has a foreign key PID referring to the 
attribute ID in Reports and another attribute to indicate the 
person's disease. After removing tuples with missing values, 
Reactions has 105,420 tuples while Reports contains 40,478 
different individuals. We take 10% least frequent sensitive 
values as transient sensitive values. There are totally 232 
transient sensitive values. 

Dynamic microdata table series TSexp — {-Pi? P2, P20} is 
created from Reactions. We divide Reactions into 20 parti- 
tions of the same size, namely Pi, P2, P20. Tj is set to Pi. 
For each i £ [2, 20], we generate Ti as follows. T; is set to Pi 
initially. Then, we randomly select 20% of tuples in Ti_i and 
insert them into Ti. Then, in the resulting Ti, we randomly 
select 20% of tuples and change their values in the sensitive 
attribute according to the sensitive value distribution of all 
tuples in Reaction as follows. For each selected tuple t in 
the above step, we randomly pick a tuple t' in the original 
data Reactions and set the sensitive value of t in T to be the 
sensitive value of t' obtained in Reaction. 

For our experiments, we have chosen a bottom-up 
anonymization algorithm [25] with a variation of involving 
the individuals that are present in the registration voter list 
but absent in the data release. Such individuals can be vir- 
tually included in an anonymized group and help to dilute 
the linkage probability of individuals to sensitive values in 
the group 18,5]- This variation helps to improve the utility 
since a group can now consist of fewer records that are ac- 
tually present in the data. A bottom-up approach is chosen 
because we find that typically the anonymized groups can be 
easily formed based on the smallest group sizes that satisfy 
the required group size ratios. In the constant-ratio strat- 
egy, the default value of k' is equal to 20. In the geometric 
strategy, the default value of a is set to 2. 

We have tested our proposed method in terms of effec- 
tiveness and efficiency. For the evaluation of our method, 
we examine four different aspects: the average size of the 
anonymized groups in the published tables, the greatest size 
of the anonymized groups in the published tables, the utility 
of the published tables and the computation overheads. 

For measuring the utility of the published data we compare 
query processing results on each anonymized table T* and its 
corresponding microdata table Tj at each publishing round. 
We follow the literature conventions 28 30 , 26 5 to measure 



*http:/ / www.hc-sc.gc.ca/dhp-mps/medefF/databasdon/index_g. html 
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the error by the relative error ratio in answering an aggregate 
query. All the published tables are evaluated one by one. 
For each evaluation, we perform 5,000 randomly generated 
range queries which follows the methodology in [30] on the 
microdata snapshot and its anonymized version, and then 
report the average relative error ratio. 

We study the effect of variations in (1) the number of 
rounds, (2) the privacy requirement £, (3) the parameter k' 
used in the constant-ratio strategy and (4) the parameter a 
used in the geometric strategy. 

Effect of I: Figure |6j a) shows that the average relative er- 
ror of the constant-ratio strategy remains nearly unchanged 
when the number of rounds (or table releases) increases. As 
expected the error is larger with larger values of I, In Fig- 
ures Ob) and (c), both the average anonymized group size 
and the maximum anonymized group size of the constant- 
ratio strategy keep nearly unchanged when we vary the num- 
ber of rounds. Again as expected, the sizes increase with I. 
Figure |6jd) shows that the execution time of the constant- 
ratio strategy keeps unchanged when there are more rounds. 
In the figure, when £ is larger, the execution time is larger. 
This is because we have to generate a larger anonymized 
group. 

Figure [7] shows similar results for the geometric strategy 
with variation on the number of rounds. From Figures [7£a) , 
(b) and (c) show that the error, the average anonymized 
group and the maximum anonymized group remains nearly 
unchanged when the number of rounds increases. In Fig- 
ure E^d), we cannot see a consistent trend when we vary i. 
The execution time when t — 3 is the smallest. However, the 
execution time when £ — 7 is smaller than that when t = 5. 
The execution time of the algorithm depends on two factors, 
namely the number of anonymized groups in the released ta- 
bles and the sizes of the anonymized groups. Generating 
anonymized groups with larger sizes will increase the execu- 
tion time. On the other hand, generating fewer anonymized 
groups will reduce the execution time. When k! = 7, since 
the factor of the total number of anonymized groups (i.e., 
fewer anonymized groups) outweighs the factor of the size of 
the anonymized group (i.e., larger anonymized group size), 
the execution time is smaller (compared with the case when 
1 = 5). 

Effect of k': We study the input parameter of k' used in the 
constant-ratio strategy. In Figure [8] the average relative er- 
ror, the average anonymized group size, the maximum group 
size and the execution time remains nearly unchanged when 
the number of rounds increases. The average relative error, 
the average anonymized group size and the maximum group 
size increases when k! increases as shown in Figures |8ja), 
(b) and (c). In Figure [8jd), we cannot observe a consistent 
trend of the execution time when k' increases. The reason is 
similar. 

Effect of a: We also study the input parameter a for the ge- 
ometric strategy. Similarly, Figure [9] shows that the number 
of rounds does not have a significant impact on the average 
relative error, the average anonymized group size, the maxi- 
mum anonymized size and the execution time. Figures |9ja) , 
(b) and (c) show that, when a increases, the average relative 
error, the average anonymized group size and the maximum 
anonymized size increases. There is no consistent trend for 
the execution time when we vary a as shown in Figures |9jd). 



The reasons are similar to that in the study with the effect 
of £ 

Overall, our proposed methods are very efficient and intro- 
duce very small querying error. It shows that our method can 
provide the global guarantee on individual privacy as well as 
maintain high utility in the published data. 

9. CONCLUSION 

In this paper, we propose a new criterion of global guarantee 
for privacy preserving data publishing. This guarantee cor- 
responds to a basic requirement of individual privacy where 
the probability of linking an individual to a sensitive value in 
one or more data releases is bounded. We show that global 
guarantee is a stronger privacy requirement than localized 
guarantee which has been adopted in previous works. We 
derive some theoretical results on this problem and discover 
that the anonymized group size is an important factor in pri- 
vacy protection. According to the anonymized group sizes, 
we propose two strategies for anonymization. Our empiri- 
cal study shows that these techniques are highly feasible and 
generate data publication of high utility. 

There are some promising future directions. In this pa- 
per, we study the global guarantee for transient sensitive 
values, meaning that the values can change freely. As a 
future plan, we will study the global guarantee when both 
transient sensitive values and permanent sensitive values are 
present. Permanent sensitive values are studied in [5] and 
refer to values that will be permanently linked to an individ- 
ual once it is linked to that individual. Intuitively, we can 
combine the technique here and that in [5] by forming the 
HD-compositions for holders and decoys, as well as forming 
anonymized groups based on the proper group size deter- 
mined by our strategies here for taking care of the transient 
values. However, the details are left for future studies. An- 
other direction is to extend the problem with the considera- 
tion of other background knowledge. 
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If increases, the above equation decreases. 



□ 



Here we give the proofs of some of the lemmas and theo- 
rems listed in the previous sections. 

Theorem [2] Consider that we published fc — 1 tables where 
an equivalence class in T k _x containing t is linked to s\. Sup- 
pose we are to publish T k where an equivalence class in T k 
containing o is also linked to si. // 7ll ^ J L 1 = B.(k — 1), then 
p(o,si,k) > l/L 
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